Abstract
The advancements in large language models (LLMs) have propelled theimprovement of video understanding tasks by incorporating LLMs with visualmodels. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat)are constrained to processing short-duration videos. Recent attempts tounderstand long-term videos by extracting and compressing visual features intoa fixed memory size. Nevertheless, those methods leverage only visual modalityto merge video tokens and overlook the correlation between visual and textualqueries, leading to difficulties in effectively handling complexquestion-answering tasks. To address the challenges of long videos and complexprompts, we propose AdaCM$^2$, which, for the first time, introduces anadaptive cross-modality memory reduction approach to video-text alignment in anauto-regressive manner on video streams. Our extensive experiments on variousvideo understanding tasks, such as video captioning, video question answering,and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-artperformance across multiple datasets while significantly reducing memory usage.Notably, it achieves a 4.5% improvement across multiple tasks in the LVUdataset with a GPU memory consumption reduction of up to 65%.